September 10, 2018
Thomas Brambor
Email: thomas.brambor@columbia.edu
Office hours: Mondays 5 - 6pm and by appointment.
Location: IAB 509E
TA1: Xinyu Ni
xn2115@tc.columbia.edu
IAB 270 Time TBA
TA2: Mikaela Zhang
xz2782@columbia.edu
IAB 270 Time Wed 10am - 12pm
Water, water, everywhere,
nor any drop to drink.
in The Rime of the Ancient Mariner, by Samuel Taylor Coleridge
Data, Data, everywhere,
nor any thought to think.
random dude on twitter
tidyverseNo matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system. — Hal Abelson
tidyversetidyverseImport:
- readr: import different kinds of rectangular data (e.g. csv, tsc, fwf) in a fast and friendly way; readxl and xml2 for special types
Tidy:
- tidyr: reshape the layout of dataframes into a specific type, the tibble – a tidy data frame
tidyverseTransform:
- dplyr provides function to manipulate and transform data frames.
- Includes select, filter, group, summarize, arrange, mutate, join etc.
Data Types:
- How to work with the different types of data such as numerics, characters (stringr), factors (forcats), and dates (lubridate)
tidyverseggplot2)broom)dplyr imports the %>% operator from the magrittr package. - x %>% f(y) is equivalent to f(x, y).stringrUsing the httr package (a wrapper for curl) to access some well-known web APIs
But HTML is messy. Will need to select the right elements and clean it up.
Old school way of getting information. Many websites do not allow it anymore (TOS) and/or make it difficult.
We have big data when the computing time for the calculation takes longer than the cognitive process of designing a model.
efficiency and structure matters (more)
in class there will be a lot of back and forth between general explanatory material, bits of code, comments to ourselves, and other stuff
My suggestion: take notes in RStudio using R Markdown notebooks or a simple R Markdown file
For your assignments, this is also a good option.
More info here: http://rmarkdown.rstudio.com/
PRO:
Reproducible. For others and your later (forgetful) self.
Live document. Combining code and output. Changes to your data or code will immediately update.
supports numerous static and dynamic output formats including HTML, PDF, MS Word, Beamer, slides, shiny applications
CON:
Piazza course site for discussion forum and announcements. Make sure to set your notification settings right.
All lectures slides, in-class exercises, homework, code etc. will be made available here: https://github.com/QMSS-G5072-2018/
All homework submitted via Github. Introduction to GitHub next week.
Next week, our github course site will turn into a private repository. Please sign up here to be part of the club.
Please also add yourself to our piazza course forum here:
Final exam/project (30%)
Assignments (60%): short weekly individual assignments.
Participation & Attendance (10%)